Using recurrent neural networks to detect supernumerary chromosomes in fungal strains causing blast diseases

Abstract The genomes of the fungus Magnaporthe oryzae that causes blast diseases on diverse grass species, including major crops, have indispensable core-chromosomes and may contain supernumerary chromosomes, also known as mini-chromosomes. These mini-chromosomes are speculated to provide effector gene mobility, and may transfer between strains. To understand the biology of mini-chromosomes, it is valuable to be able to detect whether a M. oryzae strain possesses a mini-chromosome. Here, we applied recurrent neural network models for classifying DNA sequences as arising from core- or mini-chromosomes. The models were trained with sequences from available core- and mini-chromosome assemblies, and then used to predict the presence of mini-chromosomes in a global collection of M. oryzae isolates using short-read DNA sequences. The model predicted that mini-chromosomes were prevalent in M. oryzae isolates. Interestingly, at least one mini-chromosome was present in all recent wheat isolates, but no mini-chromosomes were found in early isolates collected before 1991, indicating a preferential selection for strains carrying mini-chromosomes in recent years. The model was also used to identify assembled contigs derived from mini-chromosomes. In summary, our study has developed a reliable method for categorizing DNA sequences and showcases an application of recurrent neural networks in predictive genomics.

The genome of M. oryzae contains seven essential corechromosomes and many genomes possess one or a few extra, non-essential supernumerary chromosomes ( 13 ,14 ).Besides M. oryzae , many plants, animals, and other fungi carry supernumerary chromosomes, which are also known as ex-tra chromosomes, dispensable chromosomes, accessory chromosomes, or B-chromosomes.Supernumerary chromosomes are hypothesized to be an accelerator for fungal adaptive evolution ( 15 ).Supernumerary chromosomes in M .oryzae are referred to as mini-chromosomes because their sizes are typically smaller than core-chromosomes ( 14 , 16 , 17 ).As compared to core-chromosomes, mini-chromosomes in M. oryzae are more repetitive, containing more transposable elements and fewer genes.The repeat-rich characteristic provides ample intrachromosomal homology for DNA duplication, loss, and rearrangements, creating conducive environments to accelerate genome evolution ( 14 ,18 ).Indeed, minichromosomes are highly variable among M .oryzae strains ( 8 , 14 , 17 ).Mini-chromosomes carry effector genes that can be found in core-chromosomes in different strains, suggesting crosstalk between mini-and core-chromosomes ( 14 , 17 , 19 ).Therefore, mini-chromosomes are thought to be capable of mediating the mobility of effector genes, facilitating fungal adaptation.
To confirm and further understand the evolutionary role of mini-chromosomes, it is critical to be able to determine if a particular M. oryzae strain carries a mini-chromosome.Contour-clamped homogeneous electric field (CHEF) electrophoresis of intact chromosomes is the means to provide conclusive evidence for the presence or absence of chromo-somes with sizes smaller than core-chromosomes ( 16 ,20 ).However, the technique is laborious and requires specific equipment.This is a significant hurdle as research labs may not have access to all strains published in the literature, especially given that MoT wheat blast causing strains are quarantined in the United States.A reliable method to determine the presence of a mini-chromosome from publicly available or newly generated sequencing data would be fast, cost-effective, and decentralized.A simple strategy is to align sequencing reads to known mini-chromosome genomes for the determination of the proportion of mini-chromosome genomes supported by reads.The lack of knowledge about critical elements required for mini-chromosomes, the high-level of variability among mini-chromosomes, and the potential exchanges between core-and mini-chromosomes complicate the analysis.Fortunately, multiple complete core-chromosome and mini-chromosome genomes are currently available, providing the opportunity to deploy deep learning algorithms to learn features of core-and mini-chromosome sequences for prediction.
Use of neural network based deep learning techniques has rapidly increased due to availability of large data, and their ability to find complex patterns.Recurrent Neural Network (RNN) and, specifically, Long Short-Term Memory (LSTM) networks have been used in genomics to utilize the sequential property of DNA sequences for making various predictions (21)(22)(23).An RNN represents a group of artificial neural networks that incorporate feedback connections to retain and utilize information from prior input events as activation.These networks leverage their internal state to process input sequences of varying lengths.However, training RNNs to effectively capture long-term dependencies poses challenges as the error signals flowing backward often suffer from issues of either explosive amplification or rapid attenuation, a.k.a.exploding or vanishing gradients ( 24 ,25 ).To address this problem, the LSTM architecture was introduced as an extension to the vanilla RNN, a simple form of RNN ( 24 ).Another enhancement is the Bidirectional LSTM (Bi-LSTM), which considers sequential context from both directions and can improve performance ( 26 ,27 ).In our study, we apply Bi-LSTM deep learning to predict the presence of mini-chromosome sequences based on the genomic sequence data.Experimental results show that a Bi-LSTM neural network model can accurately infer the presence of mini-chromosomes in strains of M .oryzae .
Contour-clamped homogeneous electric field (CHEF) electrophoresis of TF05-1 TF05-1 protoplasts were prepared with the procedure slightly modified from the approach used in Orbach et al. ( 16 ).Briefly, harvested mycelia were washed with 1 M sorbitol and digested with 10 mg / ml Lysing Enzymes from Trichoderma harzianum (Sigma Aldrich, CAT#L1412) in 1 M sorbitol at 28 • C, 90 rpm for 2.5 h.The digested product was filtered through sterile Nytex nylon mesh l and centrifuged at 4500 rpm at 4 • C for 10 min to collect protoplasts.Protoplasts were washed with SE buffer (1 M sorbitol, 50 mM EDTA) and adjusted to 1 × 10 9 cells / ml.The CHEF Genomic DNA Plug Kit (Bio-Rad, CAT#1703591) was used for the preparation of protoplast plugs.Protoplasts were mixed with a 2% low melting agarose gel and transferred to modules to form protoplast plugs.After incubating in the proteinase K buffer overnight at 50 • C, plugs were washed four times with 1 × wash buffer at 25 • C, and then stored in 0.5 × TBE at 4 • C. A CHEF Mapper XA System (Bio-Rad, CAT#1703671) was used for CHEF gel electrophoresis using 0.7% Certified Megabase Agarose in 0.5 × TBE buffer.The electrophoresis was run at 1.5 V / cm and 6 • C, with switch times ranging from 1200 to 4800 seconds for 120 h.

Identification of common sequences between coreand mini-chromosomes
Alignment was performed between core-and minichromosomes for each of B71, O135, LpKY97 and TF05-1 with NUCmer ( 31 ).Alignments with at least 105 bp matches and 95% identity were retained.Alignment regions were merged if neighboring alignments were within a 100 bp distance and sequences were extracted from both core-and mini-chromosomes.All common sequences identified from these four genomes and the mitochondrial sequence of B71 were combined to form a database of sequences excluded from the training.

Bi-LSTM models
The Bi-LSTM model was implemented using Python with Ten-sorFlow and Keras libraries.The architecture consisted of an input layer, hidden LSTM cells layers, and an output layer.The input layer consists of an embedding layer that encodes the input tokens (such as the 11 9-mers tokens of the 99 bp sequence) into vectors of size 128.The hidden layers consisted of two Bidirectional LSTM layers, each with 256 hidden units, stacked on top of each other with the hyperbolic tangent (tanh) activation function.The output of the last Bi-LSTM hidden layer was connected to a dense output layer with sigmoid activation function.A model for 99b bp sequence with eleven 9-mer tokens contained a total of 34 212 225 trainable parameters.The selection of the model architecture and hyperparameters was informed by experimenting with a range of values and selecting those that resulted in the best performance on the validation set.
For training the Bi-LSTM model, backpropagation through time (BPTT) was employed, using binary cross entropy loss as the loss function.The optimization was performed using Adam optimizer with learning rate of 0.001.The training and validation data sets, such as sequences with 99 bp with eleven 9-mers and labeled with either 'core' or 'mini', were encoded using one-hot encoding and used to train and evaluate the model.The dataset was split into the train, validation, and test sets with 80 / 10 / 10 splits, respectively.To optimize model performance, a large training dataset is required while also ensuring that the validation and test set closely resemble the overall data distribution.We achieve this by utilizing a small percentage of data as the validation and test set when dealing with a large dataset like ours.A mini-batch size of 2048 was used for the training and evaluation at each epoch.To optimize the training and prevent overfitting, an early stopping criterion based on validation loss was implemented.The model was trained for a maximum of 150 epochs with patience of 15 epochs.If the validation loss did not improve over the subsequent 15 consecutive epochs, the training was stopped.The model weights corresponding to the lowest validation loss were restored, representing the best-performing model.Subsequently, the final trained model was then tested on an independent test dataset to evaluate its overall performance.
Illumina WGS short-reads of 252 M .oryzae isolates WGS reads were downloaded from Sequence Read Archive (SRA).Data of 252 accessions were collected ( Supplementary Table S1 ).Reads were trimmed with software Trimmomatic prior to further analyses ( 32 ).

Subsampled reads for determining miniC proportions
Random seeds were set for sampling reads from the forward reads of the original paired-end Illumina reads of the isolates of P3, B71, T25 and Guy11.Subsampling was implemented using seqtk (version 1.2).Subsampled reads were then used for the prediction with the optimized Bi-LSTM model.

Indexes of similarity to the B71 mini-chromosome
WGS reads of each strain were used to compare with WGS reads of B71 to infer the genomic regions of the B71 minichromosome that were absent in the isolate through Comparative Genomics Read Depth (CGRD) ( 14 ,33 ).Each CGRD analysis may identify B71 mini-chromosome regions that were absent in the analyzed strain.The proportion of the B71 mini-chromosome that was not detected as absence regions represents the portion of sequences similar to the B71 minichromosome, referred to as the index of similarity to the B71 mini-chromosome of the strain.A low value of the index indicates the absence of a mini-chromosome.

Genome sequencing and assembly of an early MoT strain T3
The MoT strain T3 was cultured on oatmeal agar (OMA) plates followed by liquid culture under Biosafety Level 3 (BSL3) laboratory in the Biosecurity Research Institute (BRI) at Kansas State University in Manhattan, KS ( 8 ,34 ).The detailed procedure for genomic DNA extraction was previously described ( 8 ).Briefly, mycelial mats were collected, lyophilized, and ground for DNA extraction with a CTAB approach.DNA was stored in the TE buffer containing 1 mg / ml RNase.Approximately 50 × paired-end (2 × 150 bp) Illumina data were produced at Novogene USA.Nanopore long reads were generated using the same genomic DNAs per the procedure described previously ( 8 ).The genomic DNA was subjected to a size selection ( > 20 kb) using a BluePippin Gel Cassette (Sage Science, USA, Cat.# BLF7510), followed by a library construction using the SQK-LSK110 kit and sequencing using a R9.4.1 flow cell on a MinION Mk1B device (Oxford Nanopore, UK).Nanopore raw FAST5 data were converted to FASTQ reads using the Guppy Basecaller (version 6.3.2).Reads were assembled with Canu (version 2.2) with the parameters of 'genomeSize = 45 m minReadLength = 10 000 minOverlapLength = 1000 correctedErrorRate = 0.08 raw-ErrorRate = 0.3 corOutCoverage = 60' ( 35 ).The contigs in Canu assemblies were aligned to B71Ref2 to determine the chromosome number and the orientation using NUCmer with the parameters of '-L10000 -I 90' (31).The resulting assembly was polished using Nanopolish (version 0.14.0) and then using Pilon (version 1.24) with Illumina reads ( 36 ,37 ).

Results
Training data to assign DNA sequences to core-or mini-chromosomes The goal of the study was to predict the presence of minichromosomes using genomic sequencing data.Although genomic data of hundreds of M. oryzae strains are publicly available, there is very little data regarding if an individual strain possesses mini-chromosome(s).We addressed this by building a Bi-LSTM model to classify DNA sequences as originating from core-or mini-chromosomes (Figure 1 ).The output of the model is used to infer the presence of mini-chromosomes in a strain based on the proportion of mini-chromosomederived sequences among the total short DNA sequences examined.We collected finished genome assemblies of M .oryzae strains with or without mini-chromosomes for model training, from which short sequences were extracted.The strains harboring at least one mini-chromosome include B71 (MoT) ( 14 ), LpKY97 (MoL) ( 38 ), TF05-1 (MoL) ( Supplementary Figure S1 ) and O135 (MoO) ( 16 ), while the strains containing no mini-chromosomes include the MoO reference strain 70-15 ( 13 ) and MZ5-1-6 (MoE) ( 30 ).Approximately 11.2 and 252.2 Mb from mini-and core-chromosomes were collected for model training (Table 1 ).Note that the presence of at least one mini-chromosome in B71, TF05-1 and O135 was verified by CHEF ( 14 , 16 , 29 ).The collected six mini-chromosomes and 42 core-chromosomes were fragmented into short DNA sequences and labeled with either mini or core as the sequence source for model training.

Training of Bi-LSTM models
Short sequences (e.g.99 bp) extracted from the six minichromosomes and 42 core-chromosomes were termed subsequences (Figure 1 A).Each subsequence was then tokenized into non-overlapping k-mers (e.g.9-mer).Afterwards, the tokenized data were split into train, validation, and test sets with an 80 / 10 / 10 split.Models were trained on the train set and evaluated for training performance and hyperparam-  The training data include DNA fragments with the length around 100 bp and labeled with the origins from either core-or mini-chromosomes.The deep learning model was trained and the optimal model was selected for predicting the origin of each sequencing read from a new strain.The miniC proportion, which is the percentage of reads predicted to originate from mini-chromosomes, is the value for inference of the presence of a mini-chromosome in the strain.( B-D ) Average performance metrics for models trained using different subsequences, each of which consists of multiple k-mers.Each X-axis label specifies the size of k and the number of k-mers (e.g.5-20 stands for a 100 bp subsequence with 20 5-mers).Performance was evaluated for all genomic data (All, blue lines) or after removing common sequences shared between core-and mini-chromosomes (reduced, orange dashed lines).eters selection using the validation set.The selected models were finally evaluated using the test set.When constructing the training dataset, we encountered imbalanced training sequence data from core-and mini-chromosomes in which the total length of core-chromosomes was markedly larger than the total length of mini-chromosomes (Table 1 ).To create balanced training data from core-and mini-chromosomes, we extracted subsequences with the step size of 1 bp from mini-chromosomes and the step size of 27 bp from corechromosomes.Models were trained with DNA sequence data of different k-mer sizes, ranging from 5 to 11 mers, and subsequence lengths.Lengths of subsequences were limited to around 100 bp because lengths of whole genome sequencing (WGS) data, or reads, of most M. oryzae strains to be used for the prediction are around 100-150 bp.Overall, the evaluation on the validation data showed that models trained with the 9mer attained the highest scores in both accuracy and precision, and the recall score was close to the highest score achieved by using 11-mer (Figure 1 B-D, Supplementary Table S2 ).
The assessment with the models on the test data set showed the consistent evaluation result ( Supplementary Table S3 ).We previously showed common sequences, particularly transposable elements, occurred in core-and mini-chromosomes ( 14 ), which created ambiguous sequence examples that did not have a clear class distinction.To examine if the occurrence of these common sequences impacted the model, we re-trained models using genomic data where common sequences were identified per strain and removed from the training data.Model performance on the validation data was improved when removing these sequences (Figure 1 B-D), and the model trained using 9-mers became the best for accuracy, precision and recall ( Supplementary Table S4 ).Within the 9mer model, the two subsequence lengths of 99 bp and 108 bp did vary for model performance, and the model trained using the 99 bp subsequences (nine 9-mers) attained better scores: 98.9% accuracy, 97.0% precision and 98.7% recall on both the validation and test datasets ( Supplementary Figure S2 , Supplementary Tables S4 , S5 ).This model was used for subsequent analysis.

Survey of presences of mini-chromosomes in cereal blast strains
The optimized Bi-LSTM model was used to examine the presence of mini-chromosomes in M .oryzae isolates whose WGS data were available.The probability of mini-chromosome origin for each WGS read was estimated, and reads with the prediction probability larger than 0.99 were classified as mini-chromosome reads ( Supplementary Figure S3 ).The proportion of mini-chromosome reads among all examined reads, referred to as the miniC proportion, was determined for each M .oryzae isolate.In total, WGS data of 252 M .oryzae isolates from multiple pathotypes were analyzed, resulting in miniC proportions ranging from 0.7% to 9.3% (Figure 2 A, Supplementary Table S6 ).Three isolates, B71 (MoT), P3 (MoT) and LpKY97 (MoL), carrying at least one mini-chromosome had miniC proportions of 3.5%, 5.8% and 5.6%, respectively.Note that the P3 and LpKY97 genomes each contained two mini-chromosomes based on the previous reports ( 14 ,28 ).In contrast, the miniC proportions of four isolates with no mini-chromosomes, 70-15, Guy11 (MoO), MZ5-1-6 (MoE) and T25 (MoT), were 0.8%, 0.9%, 1.1% and 0.9%, respectively.Based on miniC proportions of these isolates, we used 1.5% as the miniC proportion threshold to classify isolates as with or without mini-chromosomes.
The Comparative Genomics Read Depth (CGRD) pipeline was employed to identify the genomic regions of the B71 minichromosome that were absent in each isolate ( 14 ,33 ).The proportion of the B71 mini-chromosome that was not detected as absence regions represents the similarity of the potential mini-chromosome of an isolate to the B71 mini-chromosome, which was referred to as the index of similarity to the B71 mini-chromosome.Index values of 252 strains ranged from 0.03 to 1.A higher index of similarity indicates a higher possibility that an isolate carries mini-chromosome(s).Based on the index values of isolates known to carry at least a minichromosome or none ( Supplementary Table S6 ), the index threshold of 0.2 was used to classify isolates with or without a mini-chromosome.
Comparison between the prediction result from the Bi-LSTM model with the result using the CGRD approach showed that the two methods were highly consistent.Specifically, the prediction of mini-chromosome presence in 98.4% (248 / 252) isolates were the same.In total, 223 were predicted to contain mini-chromosome(s) using both approaches, indicative of a substantial presence of mini-chromosomes across M. oryzae strains.The results also indicated that different pathotypes had varying levels in mini-chromosome prevalence (Figure 2 B).More than 90% of both 196 MoO and 25 MoT isolates were predicted to carry mini-chromosomes.All isolates collected from Avena spp., Cenchrus spp., Lolium spp.and Urochloa species, and half of isolates from Digitaria spp., and Setaria spp., were predicted to contain minichromosomes.Mini-chromosomes were the least prevalent in isolates from Eleusine spp., of which only 29% (2 / 7) were predicted to contain mini-chromosomes.Note that the number of isolates of each of the pathotypes other than MoO and MoT is relatively small, ranging from 2 to 7. Two MoO isolates, namely IR0095 and JP0091, proved difficult to predict and produced different predictions from the two predic-tion approaches.The miniC proportions of the two isolates were 1.1%, while the indexes of similarity to the B71 minichromosome were 0.211 (IR0095) and 0.278 (JP0091).Both predictions of the two strains were close to the respective thresholds.
Rice isolates (MoO) were classified to four clades ( 29 ).The classified strains with whole genome sequencing reads longer than 100 bp were subjected to the miniC analysis.The prediction showed that 75% (9 / 12) isolates from clade I contain no mini-chromosomes and all isolates ( N = 55) from clades II, III, IV contain mini-chromosomes with one exception in clade II ( Supplementary Table S7 ).

Prediction using subsets of reads
To determine the minimal amount of sequencing reads required for reliable prediction, four isolates with known numbers of mini-chromosomes were selected for a simulation.These four isolates included P3 with two mini-chromosomes, B71 with one mini-chromosome, and two mini-chromosomefree isolates: T25 and Guy11 ( 14 ,29 ).Random reads, from 1000 to 300 000, were subsampled from the forward read sets of the original paired-end WGS reads, with subsampling repeated five times per isolate.As expected, the variation of miniC proportions was higher when a low amount of reads were used for the prediction (Figure 3 ).The simulation from all the four isolates consistently showed that the prediction of the miniC proportion was not very reliable when the number of reads was < 20 000.However, even when using such low numbers of reads, the predicted proportion values of minichromosomes did not deviate dramatically from the prediction value obtained using the original full read set and none of them caused a misclassification.When 50 000 or more reads were used, the predicted miniC proportions were reliably close to that using the original read set, which included millions of reads.The coefficients of variation, the ratios of standard deviation to the mean of predicted miniC values, from five independent simulations at sampling sizes of 50 000 and above were not higher than 0.082.Based on the simulation result, 100 000 and more reads are conservatively recommended for an accurate prediction of miniC proportions using our Bi-LSTM model.

Applications to identify mini-chromosome-associated sequences
In addition to predicting if a strain contains minichromosomes, we applied the Bi-LSTM model to predict if a DNA sequence from an assembly (termed contig hereafter) represented a mini-chromosome.We split each contig into continuous 99 subsequences and classified each to either core-or mini-chromosome based on the model prediction.The proportion of mini-chromosome subsequence of a contig, referred to the miniC proportion of a contig, indicates the extent to which the contig shares similarity to minichromosomes.To test the prediction strategy, the genome assembly of B71, including seven core-chromosomes and one mini-chromosome, was subjected to the analysis.As a result, the miniC proportion of the B71 mini-chromosome was 54.8%, which was markedly higher than miniC proportions of core-chromosomes ranging from 0.1% to 1.7%, (Figure 4 A, Supplementary Table S8 ).A previous study produced draft genome assemblies for MoO FR13 (Figure 4

Guy11
Figure 3. Prediction of miniC proportions using subsampled reads.MiniC proportions (Y-axis) from five times of simulations for each of four isolates (P3, B71, T25 and Guy11) were plotted versus numbers of reads used (X-axis).Each gray dot represents a predicted miniC proportion using a certain number of reads randomly extracted from an original read set of the corresponding isolate.The light blue shades were plotted using 95% confidence intervals from five simulations.Orange horizontal dash lines indicate the predicted miniC proportion values using the original full reads.Brown vertical dash lines point the recommended minimum read number for prediction.
strated that all three contained mini-chromosomes by CHEF analysis ( 17 ).MiniC proportions of contigs from each drafted assembly were determined and used to infer if a contig was derived from a mini-chromosome.Eight contigs larger than 100 kb previously found to be mini-chromosome derived were supported by the miniC proportion data, which identified an additional five contigs ( Supplementary Table S8 ).
The five contigs appeared to possess sequence features related to mini-chromosomes.In the same study, MoT BR32 was found to contain no mini-chromosomes.Consistently, the miniC proportions of all contigs are small, ranging from 0.5% to 3.1% (Figure 4 E, Supplementary Table S8 ).Furthermore, we assembled a new MoT genome from the early isolate T3 (1986) into seven chromosomes, indicative of no mini-chromosomes.All these seven chromosomes had small miniC proportions (0.5-1.4%) and can be assigned to core-chromosomes (Figure 4 F, Supplementary Table S8 ).
Collectively, the Bi-LSTM model we constructed can be used to differentiate contigs belonging to core-or minichromosomes.
To scan along individual chromosomes, the calculated miniC proportions were determined for 30 kb intervals of each chromosome of B71 and T3, which carried one and zero mini-chromosomes, respectively.Almost all intervals of the B71 mini-chromosome had a miniC proportion larger than 10%.Many intervals on the ends of corechromosomes showed a relatively high miniC proportion, indicating that they possess sequence features associated with mini-chromosomes.Notably, a region at the end of B71 chromosome 3 contained sequences with a miniC proportion level similar to a mini-chromosome.This region is absent in the genome of T3, which did not contain mini-chromosomes (Figure 4 G and H).The region represents a potential translocation event from a mini-chromosome to a core-chromosome.

Discussion
In this study, we employed a Recurrent Neural Network (RNN) deep learning technique, specifically a Bi-directional Long Short-Term Memory (Bi-LSTM) network, to model the origin of DNA sequences as belonging to core-or minichromosomes.The optimized Bi-LSTM model enables examination of the core-or mini-chromosome origin using the input data of WGS reads, assembled contigs or chromosomes, and DNA sequence fragments.The model was trained using multiple genomes with or without mini-chromosomes, learning genomic features from divergent core-and mini-chromosomes.
The core-and mini-chromosomes used in training were from different host-adapted pathotypes (MoO, MoT, MoL and MoE) and with differing composition.The prediction result from the Bi-LSTM model was similar to the result from CGRD that was an alignment-based approach and used one reference genome, which indicated that mini-chromosomes from multiple M .oryzae pathotypes share certain learnable genomic features.In contrast to CGRD that requires high-depth genome sequencing data, the Bi-LSTM prediction is accurate and reliable even using a very small amount of WGS read data.Also, the Bi-LSTM model is able to analyze both non-repetitive and repetitive sequences, overcoming a common problem of repetitive sequences limiting alignment-based analysis, and thereby allowing regional scanning along chromosomes.Crosstalk between core-and mini-chromosomes in M .oryzae was previously hypothesized ( 14 ).Our efforts to scan assembled chromosomes identified the end of chromosome 3 in the B71 as being highly similar to mini-chromosomes.Given that this region is absent in the B71 related strain, T3, this region may represent a genome structural variation arising from a translocation event from a mini-chromosome.Future analysis of more high-quality reference level assemblies of more diverse M. oryzae strains will further illuminate potential core genome variation influenced by mini-chromosomes.Analysis of 252 M .oryzae isolates reveals the prevalence of mini-chromosomes in at least some field isolates of all M .oryzae host-adapted pathotypes that we investigated.Specifically, 91% of 196 rice isolates were predicted to carry mini-chromosomes.The result is consistent with a previous examination of mini-chromosomes conducted using electrophoretic karyotyping, which found 93% of 14 rice isolates harbored mini-chromosomes ( 16 ).In the same study, none of seven wheat isolates carried mini-chromosomes.However, our analysis showed that 92% of wheat strains carried minichromosomes.The discrepancy appears to be related to the isolation period for these wheat isolates relative to the first report of wheat blast disease in 1985 in Brazil ( 39 ).Recent data indicates that both the Triticum and Lolium pathotypes evolved through two distinct episodes of sexual crosses involving individuals from five different host-adapted pathotypes, including the Eleusine pathotype ( 12 ,40 ).After emergence of populations adapted to Triticum and to Lolium spp., asexual reproduction systems apparently predominated during infections in the field, perhaps allowing mini-chromosome accumulation ( 41 ).All isolates (T1 to T7) examined in Orbach et al. (1996) were early wheat strains collected in 1988 or earlier.From our analyses, none of the three early wheat strains (T3, T25 and BR32 collected in 1986, 1988 and 1991, respectively) carried mini-chromosomes.In contrast, all of our wheat isolates collected after 2005 carried mini-chromosomes.The result indicated that wheat strains with mini-chromosome(s) were preferentially selected in the field since the 1990s.All the Lolium isolates examined were collected since the 1990s and carried mini-chromosomes.
Among all host-adapted pathotypes analyzed, a high proportion of strains of the Eleusine (MoE) pathotype lacked mini-chromosomes.Combined with two different MoE strains analyzed in the Orbach et al. study, 78% of MoE (7 / 9) isolates contained no mini-chromosomes.MoE strains were classified to the Eleusine1 and Eleusine2 lineages previously ( 3 ).Our prediction data and previous analyses indicate that an Eleusine1 strain EI9411 and an Eleusine2 strain CD156 carried mini-chromosomes ( 17 ).Although MoE isolates frequently contained no mini-chromosomes, both lineages could carry mini-chromosomes.
Our data showing a relatively low proportion of MoE strains with mini-chromosomes supports a previous report of an inverse correlation between high levels of sexual fertility and low occurrence of mini-chromosomes ( 16 ).For the ascomycetous M. oryzae , fully fertile strains are hermaphrodites that are able to serve as a female partner and produce perithecia in sexual crosses with strains of opposite mating type and also serve as male partners in crosses with other hermaphroditic strains.Orbach et al. (1996) reported that 18 fertile hermaphroditic strains, including MoE field isolates and derived fertile laboratory strains, uniformly lacked mini-chromosomes.In contrast, mini-chromosomes occur frequently in lower fertility strains, such as most rice pathogens, which either lack any mating ability or cross only as male partners with other hermaphroditic strains ( 16 ,20 ).Independent studies including analysis of complete tetrads showed that mini-chromosomes fail to segregate normally in sexual crosses, typically resulting in fewer ascospore progeny with mini-chromosomes than expected ( 16 ,42 ).Our results support a correlation between lack of mini-chromosomes and full female fertility since MoE strains, in general, are known to possess high levels of female fertility ( 1 , 16 , 43 ).In addition, the inverse association between sexual fertility and the presence of mini-chromosomes is supported by our MoO data.Our mini-chromosome prediction showed that 75% (9 / 12) of isolates from clade I are devoid of mini-chromosomes and a mere 2% (1 / 55) of isolates from clades II, III, and IV lack mini-chromosomes.This aligns consistently with observed reproductive characteristics because clade I includes strains that are fully fertile hermaphrodites (e.g.strain Guy11), and the predominantly asexual clades II, III and IV include infertile strains and strains that only cross as males ( 29 , 44 , 45 ).Further studies are needed to confirm any correlation and determine the precise role of mini-chromosomes in sexual fertility.Our mini-chromosome prediction model provides a new tool for addressing the question and tracking mini-chromosome presence in evolving populations of the blast fungus.
Our prediction model can be further improved by training using additional core-and mini-chromosome sequencing data for predicting mini-chromosomes from broader M. oryzae isolates.Of the four mini-chromosome-bearing isolates used for model training, three strains were from either wheat or Lolium hosts.The wheat and Lolium strains are genetically close, the samples therefore might be biased in favor of the mini-chromosome of B71, a wheat strain.More minichromosome sequencing data in the future model development will allow capturing the high-level diversity among minichromosomes, and thereby improving this approach.In addition to the prediction of the presence of mini-chromosomes, the model may predict the number of mini-chromosomes in each isolate.Nevertheless, this study demonstrates the potential of deep learning techniques in genomics for predicting the presence of specific genomic elements.We anticipate that in the near future, using improved explainable deep learning techniques, the critical sequence components of minichromosome DNAs may be identified by learning from massive genomic data to further understand the origin and evolution of mini-chromosomes.

Figure 1 .
Figure 1.Ov ervie w of predicting sequences from mini-c hromosomes.( A ) Finished assembled genomes of strains with or without mini-c hromosomes were used to generate training data.The training data include DNA fragments with the length around 100 bp and labeled with the origins from either core-or mini-chromosomes.The deep learning model was trained and the optimal model was selected for predicting the origin of each sequencing read from a new strain.The miniC proportion, which is the percentage of reads predicted to originate from mini-chromosomes, is the value for inference of the presence of a mini-chromosome in the strain.( B-D ) Average performance metrics for models trained using different subsequences, each of which consists of multiple k-mers.Each X-axis label specifies the size of k and the number of k-mers (e.g.5-20 stands for a 100 bp subsequence with 20 5-mers).Performance was evaluated for all genomic data (All, blue lines) or after removing common sequences shared between core-and mini-chromosomes (reduced, orange dashed lines).

FR13Figure 2 .
Figure 2.Mini-chromosome prediction of isolates from diverse pathotypes.( A ) The miniC proportion of each isolate was estimated using whole genome sequencing reads.The index of similarity to the B71 mini-chromosome represents the proportion of the B71 mini-chromosome that was not detected as deletion genomic regions using the CGRD pipeline.Dash lines signify the miniC proportion threshold of 1.5% and the similarity index threshold of 0.2 used to determine if an isolate carries a mini-chromosome.Letters stand for host species on which the strains were isolated in the field (e.g.A = Avena ; C = Cenchrus ; D = Digitaria ; E = Eleusine ; L = Lolium ; O = Oryza ; S = Setaria ; T = Triticum and U = Urochloa ). ( B ) Distribution of the number of isolates with and without mini-chromosomes in each pathotype.Total numbers of isolates and percentages of isolates with mini-chromosomes are labeled on top of bars.

Figure 4 .
Figure 4. Prediction of miniC proportions in assembled contigs or chromosomes.( A -F ) Assembled contigs or chromosomes of six strains, including five strains that are not included in the training data, were subjected to the prediction.The miniC proportion represents the proportion of sequences in each contig / chromosome predicted as miniC sequences.Y-axes signify names of contigs / chromosomes, which are listed on the column of 'contig_id' in Supplementary TableS8.The same rule is applicable for other contig names.Sizes of blue dots indicate the bp length of contigs / chromosomes.( G, H ) MiniC proportions per 30-kb interval (purple) and proportions of genic sequences per interval along each chromosome (gray).Orange curves represent LOWESS estimates of genic proportions.The red arrow indicates a core-chromosome region with a high miniC proportion value.B71 genome version: B71Ref2; T3 genome version: T3v1.

Table 1 .
Summary of genome assembly data for model training